Description of the Problem

We were given 30 State of the Nation (SONA) speeches from 1994 to 2018 to analyse. The specific objectives are to: 1. Infer sentiment and changes in sentiment over time
2. Describe the topics that emerge
3. Predict the President from a given sentence of text
4. Evaluate out of sample performance of the predictions

Approach

We collaborated using the following GitHub location: https://github.com/samperumal/dsi-assign2.

We initially split the work as follows and each of us created a folder with our names to push our work to for others to view:
- Neural Net - Sam and Merve - Bag of Words - Merve - Topic Modelling - Vanessa - Sentiment Analysis - Audrey

We presented our work to each other and made suggestions for improvement. Before diving into any prediction, we felt it is important to do an Exploratory Data Analysis (EDA) to get a sense of the high level overview of the dataset. This was done by Audrey.

The initial results from the Neural Net gave a 65% accuracy on the validation set and we wanted to feed the results of the Topic Modelling and Sentiment Analysis into the Neural Net to see if it would improve results so we needed to understand from each other what the output of these 2 methods was and the input required by the neural net to get the data into a useable format which took some discussion and a few iterations.

Given the low accuracy of the neural net (NN), we also tried a Convolutional Neural Net (CNN). Sam got the initial model working. Merve made improvements. Vanessa tuned the hyperparamters. …need more here…

The cnn did not provide much improvement over the NN so we also tried a Recurrent Neural Net (RNN) which takes in sequences of data so the order of the words is also taken into account in the model. …need more here…

Data Preparation

Initially we each performed our own import of the data, splitting out the year and president and tokenisation but we realised there was duplication of effort here and different naming conventions which made it difficult to collaborate and use each other’s output. In addition, Sam noticed that some of the data was not being read in because of special characters and sentences were not being tokenised correctly for various reasons so he became responsible for performing the data clean up (preprocessing) and outputting a .RData file that we could all then use to rerun our work.

The data as provided consisted of 30 text files, with filenames encoding the president’s name, the year of the speech, and whether it was pre/post an election (which is absent in non-election years). In working through the files, we discovered that two files were identical which was corrected in the data source with replacement. Additionally, in reading the files, we also identified 3 files that had one or more bytes which caused issues with the standard R file IO routines. Specifically 1 file had a leading Byte-Order-Mark (BOM) which is unique to windows operating system files, and 2 other files had invalid unicode characters, which suggests a text-to-speech processing application was used and experienced either transmissionor storage errors. In all the cases the offending characters were simply removed from the input files.

Having fixed basic read issues, we then examined the content of each file and the simplistic tokenisation achieved by applying unnest_tokens to the raw lines read in from the files. Several issues were uncovered, and in each case a regular expression was created to correct the issue in the raw read lines:

Having fixed the text to allow correct sentence tokenisation, and applied the unnest_tokens function, we then determined a unique ID for each sentence by applying a hash digest function to the sentence text. This unique ID allowed everyone to work on the same data with confidence, and also enabled us to detect 72 sentences that appeared identically in at least 2 speeches. As these duplicates would potentially bias the analysis and training, all instances of duplicates were removed from the dataset.

One final note is that each speech starts with a very similar boiler plate referencing various attendees to the SONA in a single, run-on sentence. We believe this header does not add significantly to the content of the speech, and so we excluded all instances across all speeches.

Change in number of sentences per president before and after filtering.

Change in number of sentences per president before and after filtering.

The figure above shows the change in number of sentences per president after filtering. On the whole there are more sentences per president, with only a single reduction. Additionally, the highest increases are associated with the files where read-errors prevented us from previously reading the entire file. This change is equally evident in the boxplots below, which show the change in distribution per president of words and characters per sentence.

Overall there is a much tighter grouping of sentences, with less variation and more conistent lengths, which is useful for techniques which depend on equal length inputs. The final histogram below shows the histogram of number of sentences per year/president after filtering, which still bears the same basic shape as before filtering, but with a better profile.

Data Split and Sampling

For all group work, we separated our full dataset into a random sampling of 80% training and 20% validation data, which was saved into a common .RData file. This ensured that there would be consistency across the data we were working on so that we could use each others work and compare results more easily.

The graphs above make it clear that our data is also very unbalanced. In an attempt to correct for this, we applied supersampling with replacement to the training dataset to ensure an equal number of sentences per president. Training was attempted using both balanced and unbalanced training data, but it did not appear to make much difference. Balancing was conducted on the training dataset only to ensure there are no duplicates in the validation set that might skew validation error.

Overview of the dataset

Each president has made a certain number of SONA speeches, depending on their term in office and whether there was 1 speech that year or 2 in the year of an election (pre and post election). Since the data is dependent on their term in the office it is unbalanced. Sentence counts per president after cleaning the data is :

## [1] "President sentence counts:"
## 
##   deKlerk   Mandela     Mbeki Motlanthe Ramaphosa      Zuma 
##       103      1879      2803       346       240      2697
## [1] "Baseline_accuracies"
## 
##   deKlerk   Mandela     Mbeki Motlanthe Ramaphosa      Zuma 
##  1.276648 23.289539 34.742191  4.288547  2.974715 33.428359

Let’s understand the number of words used by each President and how this varies across each SONA speech.

Average number of words used per President

We need to create a metric called “avg_words” which is simply the total number of words across all SONA speeches made by a particular president, divided by the total number of SONA speeches that president made.

## Joining, by = "president"
Average number of words used per President
president num_words num_speeches avg_words
Mbeki 29952 9 3328
Motlanthe 3206 1 3206
Mandela 16801 6 2800
Zuma 23554 9 2617
Ramaphosa 2258 1 2258
deKlerk 783 1 783

Insights:

On average, Mbeki used the most words in his SONA speeches, followed by Motlanthe and de Klerk used the least. Mandela and Zuma are ranked in the middle of their peers. The current president (Ramaphosa) used fewer words than all of his post 1994 peers.

Number of words used per SONA

Insights:

Of the 3 presidents that have made more than 1 SONA speech, Mbeki used more words on average than both Mandela and Zuma and the variance in the number of words used per SONA speech is also higher for Mbeki. In 2004, which was an election year, the average number of words Mbeki used was lower in both his pre- and post-election speeches. Towards the end of his term, his average number of words also dropped off. The data suggests that perhaps Mbeki’s average number of words is correlated with his confidence in being re-elected President.

Common words used across all SONA speeches

Insights:

….

Common bigrams used across all SONA speeches

Insights:

….

Lexical Diversity per President

Lexical diversity refers to the number of unique words used in each SONA.

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : Chernobyl! trL>n 6

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : Chernobyl! trL>n 6
## Warning in sqrt(sum.squares/one.delta): NaNs produced
## Warning in stats::qt(level/2 + 0.5, pred$df): NaNs produced

Insights:

The number of unique words per SONA ranges from about 700 with de Klerk in 1994 to over 2500 with Mandela in his post election speech of 1999. Mbeki’s post election speech of 2004 and Zuma’s post election speech of 2014 also got close to the 2500 mark.

It’s interesting that whilst the trend in the number of unique words used was most often upwards with Mandela, Mbeki and Zuma both show a mostly upward trend in the lead up to the election year, followed by a mostly downward trend after nearing the 2500 unique words mark in their post election speech.

If we exclude the post election speeches, the number of unique words used by Mbeki during his term from 2000 to 2008 averages just under 2000 whereas the number of unique words used by Zuma during his term from 2009 to 2017 averages just over 1500.

Lexical Density per President

Lexical density refers to the number of unique words used in each SONA divided by the total number of words and a high value is an indicator of word repitition.

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : Chernobyl! trL>n 6

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : Chernobyl! trL>n 6
## Warning in sqrt(sum.squares/one.delta): NaNs produced
## Warning in stats::qt(level/2 + 0.5, pred$df): NaNs produced

Insights:

De Klerk repeated over 30% of his words in his 1994 pre election SONA speech. On average, Mandela repeated about 25% of words in each of his SONA speeches and this reduced to about 20% in the post election speech of 1999. Mbeki’s repitition rate was about 23% and this reduced to 20% in the post election speech of 2004. Zuma’s repitition rate is over 30% with the exception of his post election speech of 2014 at about 23%.

Results

Sentiment Analysis

Analysis of Sentiment using Single Words
Sentiment Analysis using “bing” lexicon

The “bing” lexicon encodes words as either “positive” or “negative”. However, not all words used in the SONA speeches are in the lexicon so we need to adjust for this.

Sentiment per President

Let’s understand how many “positive” and “negative” words are used by each president across all their SONA speeches and create a metric called “sentiment” which is simply the total number of positive words minus the total number of negative words. We then adjust for the total number of words used from the lexicon in the “sentiment_score” metric.

## Joining, by = "word"
Sentiment Score per President
president negative positive sentiment sentiment_score
Zuma 788 1466 678 30.08
Ramaphosa 102 181 79 27.92
Mbeki 1314 2287 973 27.02
Motlanthe 180 263 83 18.74
Mandela 1036 1434 398 16.11
deKlerk 64 58 -6 -4.92

Insights:

Of the 3 presidents that have made more than 1 SONA speech, Zuma has the highest sentiment score, followed by Mbeki and then Mandela. Zuma’s sentiment score is nearly double Mandela’s. It’s interesting that the current President, Ramaphosa, has the second highest sentiment score, not far behind Zuma and only slightly ahead of Mbeki.

What are the 10 positive words most frequently used by each president?
## Joining, by = "word"
## Warning: `show.legend` must be a logical vector.

Insights:

De Klerk’s most used words were “freedom”, “peaceful” and “support” and at least 2 of these 3 come up in all the president’s most used words. Mandela’s most used words also include “progress”, “improve”, “reconciliation” and “commitment” which are all words indicating repair and a move towards something better. Mbeki uses many of the same words but also introduces “empowerment” which is a word carried through by Zuma and Ramaphosa and “success” which is carried through by Zuma. These words suggest progress in the move towards repair or something better, first spoken about by Mandela. Ramaphosa also introduces the words “confidence”, “effectively”, “enhance” and “efficient”, which are words commonly seen in a business context and have not shown up in any other SA president’s top 10 most frequently used words in a SONA since 1994.

Which of the positive words most frequently used are common across presidents?
## Joining, by = "word"
## Joining, by = "president"

Insights:

Common positive words across post 1994 presidents include: “freedom”, “regard”, “support”, “improve” and “progress”. Words introduced by Mandela and unique to his speeches are: “restructuring”, “reconciliation”, “committment”, “contribution” and “succeed”. Mbeki introduces the words “empowerment”, “comprehensive”, “integrated” and “improving” into the top words used and this is unique to his speeches. Zuma uses the words “success”, “reform” and “pleased” frequently and other presidents do not. Ramaphosa introduces the words “significant”, “productive”, “confidence” and “effectively” which have not yet been seen in the any other SA president’s top 10 most frequently used words in a SONA since 1994.

What are the 10 negative words most used by each president?
## Joining, by = "word"
## Warning: `show.legend` must be a logical vector.

Insights:

Common negative words pre 1994 include: “concerns”/“concern”/“concerned”, “unconstitutional”, “illusion”, “hopeless”, “disagree”, “deprive”, “conflict”, and “boycott”.

Common negative words post 1994 include: “corruption”, “crime”/“criminal”, “poverty”/“poor”, “inequality”, “issue”/“issues” and “crisis”.

A negative word introduced by and unique to Mandela’s top 10 is “struggle”. Mbeki is the only president with the word “racism” in his top 10 negative words. Motlanthe has “conflict” in his top 10 which no other president does. Zuma has “rail” which likely refers to the railway system and does have negative connotations for South Africa. Both Zuma and Ramaphosa use the word “difficult” a lot. Ramaphosa introduces the word “expropriation” into the top 10 for the first time amongst his peers.

How many of the negative words most used were used by each president?
## Joining, by = "word"
## Joining, by = "president"

Insights:

The interpretation is much the same as before. Note the clear separation between the top 10 negative words used pre and post 1994 elections, indicative of the pre and post apartheid regimes.

What proportion of words used are positive vs negative?
## Joining, by = "word"
## Joining, by = "year"

Insights:

The 2 vertical black lines are drawn at 60% amd 70% positivity rates. In the majority of years, SONA speeches fall within this range of positivity however a there are a few more negative speeches in earlier years and a few more positive speeches in later years.

Change in Sentiment over time

Insights:

The trend appears to be more positive and less negative over time but how can we be sure?

We will test whether negative sentiment is increasing or decreasing, then we will test whether positive sentment is increasing or decreasing . We will use a Binomial model because the frequencies are between 0 and 1. Finally, we will test whether average sentiment is increasing or decreasing using a linear model.

Is negative sentiment increasing over time?
## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!
## 
## Call:
## glm(formula = freq ~ as.numeric(year), family = "binomial", data = subset(sentiments_relative, 
##     sentiment == "negative"))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.21779  -0.05759   0.01224   0.07113   0.19692  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)
## (Intercept)       25.88105  115.18655   0.225    0.822
## as.numeric(year)  -0.01316    0.05743  -0.229    0.819
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 0.39087  on 24  degrees of freedom
## Residual deviance: 0.33826  on 23  degrees of freedom
## AIC: 27.484
## 
## Number of Fisher Scoring iterations: 3

Insights:

The slope is negative but the beta of the year variable is not significant so we cannot conclude that negative sentiment is decreasing over time.

Is postive sentiment increasing over time?
## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!
## 
## Call:
## glm(formula = freq ~ as.numeric(year), family = "binomial", data = subset(sentiments_relative, 
##     sentiment == "positive"))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.19692  -0.07113  -0.01224   0.05759   0.21779  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)
## (Intercept)      -25.88105  115.18655  -0.225    0.822
## as.numeric(year)   0.01316    0.05743   0.229    0.819
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 0.39087  on 24  degrees of freedom
## Residual deviance: 0.33826  on 23  degrees of freedom
## AIC: 27.484
## 
## Number of Fisher Scoring iterations: 3

Insights:

The slope is positive but the beta of the year variable is not significant so we cannot conclude that positive sentiment is increasing over time.

Is average sentiment increasing over time?
## Joining, by = "word"
## 
## Call:
## glm(formula = avg_sentiment ~ as.numeric(year), family = "gaussian", 
##     data = sentiments_per_year)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -21.252   -7.877   -1.282    8.584   21.444  
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)  
## (Intercept)      -1369.9476   652.4104  -2.100   0.0460 *
## as.numeric(year)     0.6952     0.3253   2.137   0.0425 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 153.4216)
## 
##     Null deviance: 4536.4  on 26  degrees of freedom
## Residual deviance: 3835.5  on 25  degrees of freedom
## AIC: 216.44
## 
## Number of Fisher Scoring iterations: 2

Insights:

The slope is positive and the beta of the year variable is significant at 1% so we can conclude that average sentiment is increasing over time.

But we need to be cautious with this interpretation because what could actually be going on here is that the “bing”" lexicon has more than double the number of negative words than positive words so this could be influencing the results and SONA speeches may in fact be more positive than they appear to be.

## 
## negative positive 
##     4782     2006
Distribution of “bing” Sentiment per President
## Joining, by = "word"

Insights:

Apart from the last 2 presidents, Ramaphosa and Zuma, the presidents are in time order. We can see that other than Motlanthe, the trend is an increasing average sentiment over time but at a decreasing rate. The interquartile range of Mbeki is smaller than Zuma’s which is smaller than Mandela’s.

Change in “bing” Sentiment over time
## Joining, by = "word"
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : Chernobyl! trL>n 6

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : Chernobyl! trL>n 6
## Warning in sqrt(sum.squares/one.delta): NaNs produced
## Warning in stats::qt(level/2 + 0.5, pred$df): NaNs produced

Insights:

Average sentiment is the proportion of positive words out of all the words in the “bing” lexicon. Mandela shows a very erratic average sentiment, ranging from 0 to over 25. Mbeki and Zuma’s average sentiment mostly ranges between 25 and 50, with the exception of a few such as 2000, 2008, 2012, 2017.

Sentiment Analysis using “afinn” lexicon (scale from -5 negative to +5 positive)
## Joining, by = "word"
afinn Sentiment
score n weighted_score
-5 2 -10
-4 27 -108
-3 689 -2067
-2 1059 -2118
-1 867 -867
1 2422 2422
2 3525 7050
3 379 1137
4 42 168
5 32 160

Insights:

The most number of words are scored positive 2, followed by positive 1. This becomes even more pronounced when scores are multiplied by counts to get weighted scores.

Let’s check the distribution of all “afinn” words:

## 
##  -5  -4  -3  -2  -1   0   1   2   3   4   5 
##  16  43 264 965 309   1 208 448 172  45   5

Words with a score of -2 dominate the lexicon, followed by words with a score of 2. We found a relatively high number of words with a score of 2 in this analysis but it is unlikely to only be a result of its prevalence in the lexicon and we can conclude that it is probably an accurate assessment of the sentiment that prevails in the text.

Distribution of “afinn” Sentiment per President
## Joining, by = "word"

Insights:

The interpretation is much the same as with the “bing” lexicon in that the trend is an increasing average sentiment over time however Zuma’s median sentiment is lower than the general trend.

Change in “afinn” Sentiment over time
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : Chernobyl! trL>n 6

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : Chernobyl! trL>n 6
## Warning in sqrt(sum.squares/one.delta): NaNs produced
## Warning in stats::qt(level/2 + 0.5, pred$df): NaNs produced

Insights:

Mandela and Zuma show a wave-like pattern of sentiment. Mbeki shows an increasing and then decreasing pattern.

Sentiment Analysis using “nrc” lexicon (infers emotion with certain words)
## Joining, by = "word"
## Joining, by = "president"

Let’s check the distribution of all “nrc” words:

## 
##        anger anticipation      disgust         fear          joy 
##         1247          839         1058         1476          689 
##     negative     positive      sadness     surprise        trust 
##         3324         2312         1191          534         1231

Insights:

Words can be assigned more than 1 sentiment but we do not expect as many words to come up under “anticipation”, “joy” or “surprise” given the relatively lower counts in the lexicon. So “anticipation” has a surprisingly high relative count across all presidents.

Given that “positive” sentiment is the most frequent classification in the “nrc” lexicon, it is not surprising that it comes out as the most frequently assigned classification across all presidents. The distributions across the various sentiments are very similar for all presidents so this lexicon does not provide any insights about specific presidents.

Sentiment Analysis using “nrc” lexicon
## Joining, by = "word"

Insights:

The negative most used words which are also associated with the “anger”, “disgust”, fear" and “sadness” emotions are: “violence”, “struggle” and “poverty”.

The positive most used words which are also associated with the “anticipation”, “joy” and surprise" emotions are: “youth”, “public” and “progress”.

The most usedwords that evoke the “trust” emotion are: “system”, “president”, “parliament” and “nation”.

##Topic Modelling

An effective topic model can summarise the ideas and concepts within a document - this can be used in various ways. A user can understand the main themes within the corpus of documents and draw conclusions from these from analysis of these topics or they can use the information as type of dimensional reduction and feed these topics into different supervised or unsupervised algorithms.

In this project, our group has used topic modelling to better understand the common topics that come up over the SONA speeches, how these are related to different presidents and speeches and how they change over time. In addition, the probability that a sentence belongs to a certain topic was used in an attempt to classify which sentence was said by which president (see Section XX)

###Data

The data used in this section is the clean and processed data as described in Section X. The resulting sentence data has been used and dissected further without consideration to train and validation unless otherwise stated.

###Methodology

The following methodology was followed:

  1. Each sentence was tokenised into “bigrams”, stop words removed and a document-term matrix set up Bigrams were chosen over individual words as they provided more context and meaning.
  2. An optimisation technique was used to help determine the number of topics that are covered in the corpus of documents and this optimisation was validated on a hold out sample.
  3. Latent Dirichlet allocation was used to determine the probability of bigrams belong to certain topics and the probability that sentences belonged to topics.
  4. Text mining methods were deployed to extract insight into the different topics
  5. The probability of sentences to each topics were then passed through to a neural network.

####Step One: Tokenisation, Remove Stop words and Document Term Matrix

Figure: Most popular terms

Figure: Most popular terms

After tokenisation and removal of stop words, the top 20 most used terms across all of the SONA speeches are displayed. Unsurprisingly, “South Africa” is the most used term followed closely by “South African” and “South Africans” and “Local Government”. These terms do not add to our understanding of the topics and tend to confuse the topic modelling going forward. The removal of the terms allows for a cleaner interpretation. "Public service is then the most used term.

###Step 2: Optimisation of k - the number of topics.

A pre-requisite of topic modelling is knowing the number of topics that each corpus may contain (i.e. the latent factor k) In some cases, this may be a fair assumption but without reading though each speech, how could one know how many different topics have been articulated in the SONA’s? Luckily, Murzintcev Nikita has published an R- package (ldatuning) that helps to optimise the number of topics (k) over three different measures. The measures used to determine the number of topics, are discussed in an RPubs paper which can found here: link and the following optimisation largely follows the accompanying vignette: [link] (https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html)

The following extract from the RPub paper gives a brief explanation of the methods used to optimise for k:

Extract from RPubs

"Arun2010: The measure is computed in terms of symmetric KL-Divergence of salient distributions that are derived from these matrix factor and is observed that the divergence values are higher for non-optimal number of topics (maximize)

CaoJuan2009: method of adaptively selecting the best LDA model based on density.(minimize)

Griffths: To evaluate the consequences of changing the number of topics T, used the Gibbs sampling algorithm to obtain samples from the posterior distribution over z at several choices of T(minimize)"

In addition to this, Nikita considers how the number of k may change over a validation or hold out sample. His term for this is“perplexity” which he defines as “[it] measures the log-likelihood of a held-out test set; Perplexity is a measurement of how well a probability distribution or probability model predicts a sample”

Below is an attempt to optimise for k and to check that the choice of k holds over an unseen data set.

Figure: Optimisation Metrics

Figure: Optimisation Metrics

From the above plot, the marginal benefit from adding another topic, stops at around 8-10 topics. In order to test this, the “perplexity” over a test sample for the document term matrix can be checked.

Figure: Perplexity Plot

Figure: Perplexity Plot

As more topics are used, the perplexity of the training sample does decrease but that of the test sample increases from around 11 topic. The perplexity of the test sample seems to be minimised at around 8 topics.

The evidence from these two plots suggest that the optimal number topics sits at around 8 topics.

###Step 3: Latent Dirichlet allocation

For this assignment, Latent Dirichlet allocation (LDA) was used for the topic modelling. Other methods, such as a Latent Semantic Analysis (LSA) or Probabilistic Latent Semantic Analysis (pLSA) could have been used but LDA is useful due to the fact that it allows: 1. Each document within the corpus to be a mixture of topics 2. Each topic to be a mixture of bigrams 3. The topics are assumed to be drawn from Dirichlet distribution (i.e. not k different distributions as with pLSA) so there are less parameters to estimate and no need to estimate the probability that the corpus generates a specific document.

Step 4: Extracting insights

####Understanding the topics via the bigrams

The beta matrix produced gives the probability of the topic producing that bigram (i.e. that the phrase is in reference to that topic.) From this measure, one can get a sense of what the character of the topic is. By using the most popular phrases in each topic, understanding of the flavour of each topic emerges. However, it must be kept in mind that terms can belong to more than one topic so applying some logic to get a theme or flavour should be done liberally.

#####Topic One

From the display of popular terms, it can be determined that the topic one has a vague connection to “job creation” - this is the most common terms but is supported by other terms that have a high probability of being generated by this topic such as: + “world cup” + “national youth” + “infrastructure development”

These concepts all support the idea of the job creation as each of these will generate jobs for the country. But there is some noise in the topic for “address terms” i.e. honourable speaker or honourable chairperson. “Nelson Mandela” and “President Mandela” crop up too which suggests that alongside the job creation theme, there exists some of what can be termed “terms of endearment”

Figure: Popular terms in Topic 1 Figure: WordCloud for Topic1

#####Topic Two

As with the previous topic, there is some random “terms of endearment” in this topic as well (i.e. “madam speaker”) but it is not as evident as in the first topic. This is to be expected as bigrams can be generated by more than one topic as each topic is a mixture of bigrams! The next four terms sums out the main themes for this topic: + “Economic Empowerment” + “Black Economic” + “Justice System” + “Criminal Justice”

In summary, this topic can be summed as “Economy/ Criminal and Justice System”

Figure: Popular terms in Topic 2 Figure: WordCloud for Topic 2

#####Topic Three

Despite the most popular terms being “United Nation” and “private sector”, there a theme that is “developing”. As in development plan, resource development, national development and development programme etc. And thus, the topic is named.

Figure: Popular terms in Topic 3 Figure: WordCloud for Topic 3

#####Topic 4

Once again, there is a “term of endearment” in the popular terms (“fellow south” which is assumed short for “fellow South African’s” which is one of former President Zuma’s favourite phrases). With all the other terms combined, a theme of “Social Reform/ Regional and Municipal Government” takes shape.

Given that there is a possible trigram evident here, it may be worth exploring in future work.

Figure: Popular terms in Topic 4 Figure: WordCloud for Topic 4

#####Topic 5

“Public sector” and “private sector” are popular terms in topic 5. After consideration of the various other terms, of which some have cross over with other topics and discussion, the eventual name for this topic became “Public Sector Entities”

Figure: Popular terms in Topic 5 Figure: WordCloud for Topic 5

A different way of looking at this topic could be to investigate the biggest differential in terms between topics. For instance, using the log(base 2) ratio between topic 1 and topic 5, shows the terms that have the widest margin between the two topic (i.e. are far more likely to be in topic 5 versus topic 1)

Figure: WordCloud for Topic 5

Figure: WordCloud for Topic 5

For instance, “social programmes”, “human fulfilment”, “rights commission” are all generated in significantly larger proportions by Topic 5 compared to Topic 1 while “national social”, “training colleagues” and “sector unions” all exist with in Topic 1.

Given the naming of Topic 5 as “Public Sector Entities” and Topic 1 as “Job Creation/Terms of Endearment” these terms do seem to be grouped in line with expectation.

####Understanding the mixture of topics within the sentence

The LDA model allows each of the sentence to be represented as a mixture of topics. The gamma matrix shows the document-topic probability for each sentence. i.e. the probability that each sentence is drawn from that topic. For instance, the follow sentence sampled from random shows that it has a 0.905% probability of being drawn from topic 4 based on the use of the bigrams within it. The sentence appears to be talking about the water and the infrastructure around it. The label for topic of was “Social Reform/ Regional and Municipal Government” and this statement seems to be somewhat relevant to it.

Sample sentence showing topic probabilities
president year sentence X1 X2 X3 X4 X5
Zuma 2010 yet, we still lose a lot of water through leaking pipes and inadequate infrastructure. 0.023575 0.023575 0.023575 0.9057001 0.023575

Using this method, the sentences can be roughly classified to a topic based on the probabilities (i.e. classify the sentence by the topic with the highest probability) and further analysis can be conducted.

(Note: the which.is.max breaks ties at random so where a sentence has equal probabilities, is will decide at random to which topic it gets assigned)

Figure: WordCloud for Topic 5

Figure: WordCloud for Topic 5

Consider the mixture of topics that each individual president covers during the SONA address. Despite the imbalance in the number of sentences said by each president, there seems to be a fairly standard shape to the topics discussed. The two exceptions to this are de Klerk and Zuma. All other presidents tend to send around 10-15% on topic 1 (“Job Creation/Terms of Endearment”) and the 15-20% on Topic 2 (“Economy/Criminal and Justice System”), Topic 3 (“Development”) and Topic 4 (“Social Reform/Regional and Municipal Government”) and the around another 10% on Topic 5 (“Public Sector Entities”). This trend means that it may be difficult for a supervised model to pick up difference in presidents based on the topic covered.

As stated, the only two presidents where this trend differs are President de Klerk and President Zuma. President de Klerk spend the majority of his time on Topic 1 (“Job Creation/Terms of Endearment”) followed by Topic 2 (“Economy/Criminal and Justice System”). Given the context around the time period, it may be unsurprising that “terms of endearment” and “criminal and Justice systems” come up since his speech would be littered with names of people and political parties as well as talking about past injustices.

President Zuma spends the majority of his speeches on Topic 4 (“Social Reform/Regional and Municipal Government”). Once again, given context that his terms as President was marked with service delivery strikes, two major droughts over a few different regions and discussions around and reform this may be unsurprisingly. And in fact, when the most popular terms from topic 4 is recalled (“fellow south”) is may even be predictable that this would be the most “talked about” topic for President Zuma. What is interesting that given the attention to the issues of State Capture that characterised Zuma’s presidency, his coverage of Topic 5 (“Public Sector Enterprises”) is much smaller than that of his peers.

A similar analysis can be taken over time.

Figure: WordCloud for Topic 5

Figure: WordCloud for Topic 5

The graph shows that over time, topics 1 and 5 are the least discussed topics while topics 2,3 and 4 all get much the same airtime. There are a number of notable spikes/valleys: + In 1996, Topic 2 (“Economy/Crime and Justice System”) spikes

The 1996 SONA was a few months ahead of the introduction of the new constitution as well as at the time of the start of the Truth and Reconciliation Commission. It could be suggested that these two ideas would drive up the topic in the SONA speech.

  • In 2005, topic 1 (“Job Creations/Terms of Endearment”) dives while topic 4 (“Social Reform/Region and Municipal Government”) and Topic 2 (“Economy/Criminal and Justice System”) spike considerably

    Mbeki’s term in presidency (1998 - 2008) was characterised by a rise in crime specifically in farm attacks as well as the HIV/AID epidemic and the start of the Black Economic Empowerment in 2005 which could attribute the spikes and drops of topics in 2005.

  • In 2012, Topic 2 dives considerably (“Economy/Criminal and Justice System”)

From various media reports, Zuma’s 2012 SONA speech largely covered the success of the government while skipping over future plans. It may be a reason while Topic 4 (“Social Reform/Regional and Municipal Government”) rises sharply.

####Step Five: Using the topic to predict the president

One of the aims behind topic modelling is to reduce the dimensions of the data to allow for other techniques to be applied. In this instance, the aim was to reduce the SONA speeches to a collection of topics that would help predict which president was responsible for a sentence in the SONA speech. The assumption was that each president might have a unique set of topics or mixture of topics that could characterise their particular speech. However, there does not seem to be evidence of this. The matrix with the probability of each sentence belonging to a topic is used in Section X and the results are discussed.

Neural Nets

Neural Net with Bag of Words Data

The count of each word that have been used in each sentence is what we are going to be feeding in. We need to unnest the sentence data, count each word in each sentence and spread the sentence word counts so that we have sentence id’s in each row and we have each word as column id’s. This is the simplest neural net model that we can try, so that was our first model.

Figure: Bag of words model - neural net training

Figure: Bag of words model - neural net training

This model has L2 regularization to avoid overfitting but even so it didn’t help very much. The accuracy is 0.55886. The optimizer_rmsprop has learning rate 0.003. This was choosen after trying lr=c(0.001, 0.002, 0.003) To make readability easier the model with best learning rate is used.

As we can see from the plot model overfits after the second iteration, since the loss function start increasing in value, so to avoid that let’s use a smaller model with less neurons and add a dropout.

Figure: Bag of words model - neural net training

Figure: Bag of words model - neural net training

180, 50, 57, 14, 7, 1, 33, 181, 16, 8, 12, 2, 56, 26, 111, 10, 4, 3, 7, 4, 3, 3, 0, 3, 4, 9, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1

Accuracy rate is: 0.5911.

Cohen’s Kappa

Kappa value tells you how much better your classifier is performing over the performance of a classifier that simply guesses at random according to the frequency of each class.

“Cohen’s kappa is always less than or equal to 1. Values of 0 or less, indicate that the classifier is useless. There is no standardized way to interpret its values. Landis and Koch (1977) provide a way to characterize values. According to their scheme a value < 0 is indicating no agreement, 0-0.20 as slight, 0.21-0.40 as fair, 0.41-0.60 as moderate, 0.61-0.80 as substantial, and 0.81-1 as almost perfect agreement.” [Reference: Landis, J.R.; Koch, G.G. (1977). “The measurement of observer agreement for categorical data”. Biometrics 33 (1): 159-174]

The Kappa value is, 0.416 and this means we are doing better than random.

The accuracy is slightly better than bigger model with no-dropouts (0.581%) Just for the word-count model seems good enough. But this model does not consider how important each word is to its corpus. So we should consider a better model to try.

After the fourth iteration validation loss starts increasing which is a sign of overfitting.

Neural Net with tf-idf Data

TFIDF is a statisic that shows how important a word is to it’s corpus. So if we are feeding NN with TFIDF we are logically expecting the results to be slightly better than the word-cout NN model.

1 2 3 4 6
183 46 51 0 0
48 203 18 1 0
52 17 118 1 0
16 8 11 0 0
5 16 3 0 0
0 3 4 2 1
Figure: Bag of words model - neural net training

Figure: Bag of words model - neural net training

Accuracy rate is: 0.6245.

Cohen’s Kappa

The accuracy is 0.6158 and the model starts overfitting after fourth epoch. So this is slightly better than the bag of words word-count model as we expected.

Neural Net with Sentiment Analysis Data

1 2 3
143 134 3
70 199 1
91 95 2
15 20 0
8 15 1
3 7 0
Figure: Bag of words model - neural net training

Figure: Bag of words model - neural net training

Accuracy of sentiment analysis is: 0.426 [0.4263].

Cohen’s Kappa

Sentiment analysis also reaches it’s smallest validation loss value on the fifth iteration. But the train accuracy and test accuracy is changing very slightly at each iteration. This model does not seem to be doing well on eighter the training set or the test set. The test accuracy is 0.4361834 and the training accuracy is 0.4353395. If we look at the NRC sentiment lexicon it is visible that all presidents share same sentiment distribution pattern, so this is why it is not overfitting. Because our set aside test set is actually no different than the training set.

Neural Nets with Topic Modelling Data (Gamma Values)

Topic modelling only predicted for president 1(Mbeki) and president 2(Zuma).

Figure: Bag of words model - neural net training

Figure: Bag of words model - neural net training

Cohen’s Kappa

The train and the test set are not very distinct from each other just like sentiment analysis. If we look at the mixture of the topics by each president in topic modelling chunk we can see that all the topics for each president are kind of uniformed and hard to seperate each president’s topic from one another.

CNN with Transfer Learning (Pre-trained Embeddings)

We will be using GloVe embeddings. GloVe stands for “Global Vectors for Word Representation” and is an unsupervised learning algorithm for obtaining vector representations for words as it is stated in their website. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. [Reference:Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation:https://nlp.stanford.edu/pubs/glove.pdf ] Specifically, we will use the 100-dimensional GloVe embeddings of 400k words computed on a 2014 dump of English Wikipedia.

This note is taken from the reference given above and as they state the accuracy they achieved on python is twice as good.

“IMPORTANT NOTE:This example does yet work correctly. The code executes fine and appears to mimic the Python code upon which it is based however it achieves only half the training accuracy that the Python code does so there is clearly a subtle difference. We need to investigate this further before formally adding to the list of examples”

[reference for implementation on Python: https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html] [reference for implementation on R: https://keras.rstudio.com/articles/examples/pretrained_word_embeddings.html] [reference for implementation on R: https://github.com/rstudio/keras/blob/master/vignettes/examples/pretrained_word_embeddings.R]

Also for the pre-trained embeddings to work well it needs to be trained with similar type of data that you are trying to classify, and the fact that the GloVe embeddings are trained on Wikipedia data one can expect that it would not neccesarily help predict the presidents better for our sentences.

Majority of the sentences of Mbeki is predicted as Mandela(379/569). Majority of the sentences of Zuma is predicted as Mandela (244/534) and the secon majority is predicted as himself (195/534). Majority of the sentences of Mandela is predicted as Mandela(263/381). Majority of the sentences of Mothlane(38/66) is predicted as Mandela. Majority of sentences of Ramaphosa is predicted as Mandela(19/47) or Zuma(15/47). Mojority of the sentences of deKlerk is predicted as Mandela(14/17).

Cohen’s Kappa

Sequential Neural Networks

The bag-of-words model as applied to Neural Networks treats each sentence as an unordered list of integer or one-hot-encoded elements. This captures whether a word occurs in a sentence, and the frequency of occurrence for tf-idf models. While this can be effective, it does ignore any signals in the data related to the ordering and relative positions of words. Sequential neural networks address this problem by treating the data as an ordered list of integers using a dictionary that provides a unique mapping between words and integers. The network then applies various layers to this input that attempt to extract the sequential information for use in later standard layers.

For all our sequential neural network attempts, we converted each sentence to a vector of integers using a word dictionary as our x-data, and one-hot-encoded the presidents as our y-data.

Embeddings

An embedding layer is a dimensionality reduction technique that attempts to encode the relationships between words in a sentence, input as a variable length integer array with padding, as a fixed-length floating point vector. An embedding has a tunable hyper-parameter for the number of latent factors to map every sentence on to, where each latent factor attempts to capture a semantic dimension of the sentence as a whole. Embeddings aim to capture the linear substructure of sentences through euclidean distances between words in the n-dimensional unit hypercube, where n is the number of latent factors specified.

Embeddings can be trained on the corpus of sentences that comprise the dataset under investigation, however this can prove limiting if the there is a relatively small quantity of training data. An alternative approach is to re-use a previously trained embedding layer, such as the Glove embedding . This has the advantage of leveraging the results from a much larger, and theoretically more generic, dataset in an application of transfer learning. The SONA data has a large number of non-standard or foreign words included, however, which theoretically limits the applicability of pre-trained embeddings.

Convolutional Neural Network (CNN)

A convolutional layer applies a moving weighted-average filter (kernel) over the input data that attempts to extract simple patterns for use in later layers. They are an approach to reducing the dimensionality of input data by using a shared weighting across all input nodes, thereby addressing the exploding/vanishing gradient problem that would otherwise occur with a standard fully-connected layer. By way of example, a 100-node input layer followed by a 50-node fully connected layer would have 5000 weights to fit, whereas with an equivalent convolutional layer there would only be 50 weights.

The convolutional layer has a number of tunable hyper-parameters:

Chosen structure

We experimented with a number of different network topologies, and finally settled on the following:

  • An embedding layer with 70 latent factors trained on the full word dictionary across all speeches.
  • A dropout layer set to randomly exclude 50% of inputs from the previous layer on each iteration, to prevent overfitting.
  • A convolutional layer with 50 filters, a kernel size of 3, and stride of 1
  • A global max pooling layer, which assists in dimensionality reduction
  • A fully connected dense layer with 128 nodes
  • A dropout layer set to randomly exclude 50% of inputs from the previous layer on each iteration, to prevent overfitting.
  • An activation layer using the “relu” function
  • A fully connected dense layer with 6 nodes, to map to our output encoding
  • An activation layer using the “softmax” function, to produce output consistent with our one-hot-encoding of presidents.

Results

Deep CNN Results

We also experimented with Deep CNN architectures by adding on additional densely connected layers (with accompanying dropout and activation layers) below the Convolutional layer. Despite much experimentation with this, additional layers did not appear to have any noticeable effect on the accuracy of our results.

Recurrent Neural Network (RNN)

A Recurrent Neural Network is an attempt to model the relationship between words in a sentence based on their relative positions. It involves repeatedly applying the same layer to each word of a sentence (rather than the sentence as a whole), in a manner that allows the layer to “remember” aspects of the words in the sentence that have already been seen. For our application we used a long-term-short-term (LSTM) memory layer that trains both the weights and the memory of the layer.

Chosen structure

We experimented with a number of different network topologies, and finally settled on the following:

  • An embedding layer with 70 latent factors trained on the full word dictionary across all speeches.
  • A dropout layer set to randomly exclude 50% of inputs from the previous layer on each iteration, to prevent overfitting.
  • A convolutional layer with 50 filters, a kernel size of 3, and stride of 1
  • A global max pooling layer, which assists in dimensionality reduction
  • A fully connected dense layer with 128 nodes
  • A dropout layer set to randomly exclude 50% of inputs from the previous layer on each iteration, to prevent overfitting.
  • An activation layer using the “relu” function
  • A fully connected dense layer with 6 nodes, to map to our output encoding
  • An activation layer using the “softmax” function, to produce output consistent with our one-hot-encoding of presidents.

Results

RNN’s have been shown to achieve very good performance when applied to Natural Language Processing (NLP) of text, due to the similarities with how humans process language. Unfortunately our applied RNN did not surpass the performance of the other networks we attempted, despite many attempts at tuning. We suspect the sparsity of the dataset played a large role in this result, as it was replicated on both the balanced and unbalanced training data.

Neural Net (nn) to predict the President from a Sentence

Evaluate Out of Sample Performance

test_acc

Convolutional Neural Net (cnn) to predict the President from a Sentence

Evaluate Out of Sample Performance

Recurrent Neural Net (rnn) to predict President from Sentence

Evaluate Out of Sample Performance

here we can combine all the train and test accuracies in a table, as a conclusion.

Analysis of Results and Conclusion

Also done the critisism after each chunk.